Modular architecture for entity normalization

ABSTRACT

A system and method identifying duplicate objects from a plurality of objects. The system and method groups similar objects into buckets based on a selected grouper, matches objects within the same bucket based on a selected matcher, and identifies the matching objects as duplicate objects.

CROSS-REFERENCE TO RELATED PATENT APPLICATIONS

This application is related to the following U.S. Applications all ofwhich are incorporated by reference herein:

-   -   U.S. application Ser. No. 11/357,748, entitled “Support for        Object Search”, filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/342,290, entitled “Data Object        Visualization”, filed on Jan. 27, 2006;    -   U.S. application Ser. No. 11/342,293, entitled “Data Object        Visualization Using Maps”, filed on Jan. 27, 2006;    -   U.S. application Ser. No. 11/356,679, entitled “Query Language”,        filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/356,837, entitled “Automatic Object        Reference Identification and Linking in a Browseable Fact        Repository”, filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/356,851, entitled “Browseable Fact        Repository”, filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/356,842, entitled “ID Persistence        Through Normalization”, filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/356,728, entitled “Annotation        Framework”, filed Feb. 17, 2006;    -   U.S. application Ser. No. 11/341,069, entitled “Object        Categorization for Information Extraction”, filed on Jan. 27,        2006;    -   U.S. application Ser. No. 11/356,765, entitled “Attribute        Entropy as a Signal in Object Normalization”, filed Feb. 17,        2006;    -   U.S. application Ser. No. 11/341,907, entitled “Designating Data        Objects for Analysis”, filed on Jan. 27, 2006; and    -   U.S. application Ser. No. 11/342,277, entitled “Data Object        Visualization Using Graphs”, filed on Jan. 27, 2006.

TECHNICAL FIELD

The disclosed embodiments relate generally to fact databases. Moreparticularly, the disclosed embodiments relate to identifying duplicateobjects in an object collection.

BACKGROUND

Data is often organized as large collections of objects. When theobjects are added over time, there are often problems with dataduplication. For example, a collection may include multiple objects thatrepresent the same entity. As used herein, the term “duplicate objects”or any variation thereof, is intended to cover objects representing thesame entity. Duplicate objects are not necessarily identical; they canhave different facts or different values of the same facts.

Duplicate objects are undesirable for many reasons. They increasestorage cost and take a longer time to process. They lead to inaccurateresults, such as an inaccurate count of distinct objects. They alsocause data inconsistency. For example, subsequent operations affectingonly some of the duplicate objects cause objects representing the sameentity to be inconsistent.

Traditional approaches to identify duplicate objects assume ahomogeneity in the input set (all books, all products, all movies, etc),and compare different facts of objects to identify duplication forobjects of different types. For example, when identifying duplicateobjects in a set of objects representing books, traditional approachesmatch the ISBN value of the objects; and when identifying duplicateobjects in objects representing people, traditional approaches match theSSN value of the objects. One drawback of the traditional approaches isthat they are only effective to specific types of objects, and tend tobe ineffective when applied to a collection of objects with differenttypes. Also, even if the objects in the collection are of the same type,these approaches are ineffective when the objects include incomplete orinaccurate information.

For these reasons, what is needed is a method and system that identifiesduplicate objects in a large number of objects having different typesand/or incomplete information.

SUMMARY

The invention is a system and method for identifying duplicate objectsfrom a plurality of objects. Objects are grouped into buckets using aselected grouper. Objects within the same bucket are compared to eachother using a selected matcher to identify duplicate objects. Thegrouper and the matcher are selected from a collection of groupers andmatchers. This approach is computationally cost-efficient becauseobjects are pair-wise matched only within a bucket, rather thanpair-wise matched across all buckets. This approach can identifyduplicate objects from objects with different types, and incompleteand/or inaccurate information by selecting groupers and matchersdesigned to handle such scenarios.

One method for identifying duplicate objects is as follows. A grouper isselected from a collection of groupers to apply to the objects andgenerate a signature for each of the objects. Objects sharing a samesignature are grouped into the same bucket. A matcher is selected from acollection of matchers to match objects within the same bucket. Matchingobjects are determined to be duplicate objects.

These features are not the only features of the invention. In view ofthe drawings, specification, and claims, many additional features andadvantages will be apparent.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a network, in accordance with a preferred embodiment of theinvention.

FIGS. 2( a)-2(d) are block diagrams illustrating a data structure forfacts within a repository of FIG. 1 in accordance with preferredembodiments of the invention.

FIG. 2( e) is a block diagram illustrating an alternate data structurefor facts and objects in accordance with preferred embodiments of theinvention.

FIG. 3 is a flowchart of an exemplary method for identifying duplicateobjects in accordance with a preferred embodiment of the invention.

FIG. 4 is a simplified diagram illustrating an object being processedfor identification of its duplicate objects in accordance with apreferred embodiment of the invention.

FIGS. 5( a)-(e) illustrate an example of identifying duplicate objects,in accordance with a preferred embodiment of the invention.

The figures depict various embodiments of the present invention forpurposes of illustration only. One skilled in the art will readilyrecognize from the following discussion that alternative embodiments ofthe structures and methods illustrated herein may be employed withoutdeparting from the principles of the invention described herein.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS System Architecture

FIG. 1 shows a system architecture 100 adapted to support one embodimentof the invention. FIG. 1 shows components used to add facts into, andretrieve facts from a repository 115. The system architecture 100includes a network 104, through which any number of document hosts 102communicate with a data processing system 106, along with any number ofobject requesters 152, 154.

Document hosts 102 store documents and provide access to documents. Adocument is comprised of any machine-readable data including anycombination of text, graphics, multimedia content, etc. A document maybe encoded in a markup language, such as Hypertext Markup Language(HTML), i.e., a web page, in an interpreted language (e.g., JavaScript)or in any other computer readable or executable format. A document caninclude one or more hyperlinks to other documents. A typical documentwill include one or more facts within its content. A document stored ina document host 102 may be located and/or identified by a UniformResource Locator (URL), or Web address, or any other appropriate form ofidentification and/or location. A document host 102 is implemented by acomputer system, and typically includes a server adapted to communicateover the network 104 via networking protocols (e.g., TCP/IP), as well asapplication and presentation protocols (e.g., HTTP, HTML, SOAP, D-HTML,Java). The documents stored by a host 102 are typically held in a filedirectory, a database, or other data repository. A host 102 can beimplemented in any computing device (e.g., from a PDA or personalcomputer, a workstation, mini-computer, or mainframe, to a cluster orgrid of computers), as well as in any processor architecture oroperating system.

FIG. 1 shows components used to manage facts in a fact repository 115.Data processing system 106 includes one or more importers 108, one ormore janitors 110, a build engine 112, a service engine 114, and a factrepository 115 (also called simply a “repository”). Each of theforegoing are implemented, in one embodiment, as software modules (orprograms) executed by processor 116. Importers 108 operate to processdocuments received from the document hosts, read the data content ofdocuments, and extract facts (as operationally and programmaticallydefined within the data processing system 106) from such documents. Theimporters 108 also determine the subject or subjects with which thefacts are associated, and extract such facts into individual items ofdata, for storage in the fact repository 115. In one embodiment, thereare different types of importers 108 for different types of documents,for example, dependent on the format or document type.

Janitors 110 operate to process facts extracted by importer 108. Thisprocessing can include but is not limited to, data cleansing, objectmerging, and fact induction. In one embodiment, there are a number ofdifferent janitors 110 that perform different types of data managementoperations on the facts. For example, one janitor 110 may traverse someset of facts in the repository 115 to find duplicate facts (that is,facts that convey the same factual information) and merge them. Anotherjanitor 110 may also normalize facts into standard formats. Anotherjanitor 110 may also remove unwanted facts from repository 115, such asfacts related to pornographic content. Other types of janitors 110 maybe implemented, depending on the types of data management functionsdesired, such as translation, compression, spelling or grammarcorrection, and the like.

Various janitors 110 act on facts to normalize attribute names, andvalues and delete duplicate and near-duplicate facts so an object doesnot have redundant information. For example, we might find on one pagethat Britney Spears' birthday is “12/2/1981” while on another page thather date of birth is “Dec. 2, 1981.” Birthday and Date of Birth mightboth be rewritten as Birthdate by one janitor and then another janitormight notice that 12/2/1981 and Dec. 2, 1981 are different forms of thesame date. It would choose the preferred form, remove the other fact andcombine the source lists for the two facts. As a result when you look atthe source pages for this fact, on some you'll find an exact match ofthe fact and on others text that is considered to be synonymous with thefact.

Build engine 112 builds and manages the repository 115. Service engine114 is an interface for querying the repository 115. Service engine114's main function is to process queries, score matching objects, andreturn them to the caller but it is also used by janitor 110.

Repository 115 stores factual information extracted from a plurality ofdocuments that are located on document hosts 102. A document from whicha particular fact may be extracted is a source document (or “source”) ofthat particular fact. In other words, a source of a fact includes thatfact (or a synonymous fact) within its contents.

Repository 115 contains one or more facts. In one embodiment, each factis associated with exactly one object. One implementation for thisassociation includes in each fact an object ID that uniquely identifiesthe object of the association. In this manner, any number of facts maybe associated with an individual object, by including the object ID forthat object in the facts. In one embodiment, objects themselves are notphysically stored in the repository 115, but rather are defined by theset or group of facts with the same associated object ID, as describedbelow. Further details about facts in repository 115 are describedbelow, in relation to FIGS. 2( a)-2(d).

It should be appreciated that in practice at least some of thecomponents of the data processing system 106 will be distributed overmultiple computers, communicating over a network. For example,repository 115 may be deployed over multiple servers. As anotherexample, the janitors 110 may be located on any number of differentcomputers. For convenience of explanation, however, the components ofthe data processing system 106 are discussed as though they wereimplemented on a single computer.

In another embodiment, some or all of document hosts 102 are located ondata processing system 106 instead of being coupled to data processingsystem 106 by a network. For example, importer 108 may import facts froma database that is a part of or associated with data processing system106.

FIG. 1 also includes components to access repository 115 on behalf ofone or more object requesters 152, 154. Object requesters are entitiesthat request objects from repository 115. Object requesters 152, 154 maybe understood as clients of the system 106, and can be implemented inany computer device or architecture. As shown in FIG. 1, a first objectrequester 152 is located remotely from system 106, while a second objectrequester 154 is located in data processing system 106. For example, ina computer system hosting a blog, the blog may include a reference to anobject whose facts are in repository 115. An object requester 152, suchas a browser displaying the blog will access data processing system 106so that the information of the facts associated with the object can bedisplayed as part of the blog web page. As a second example, janitor 110or other entity considered to be part of data processing system 106 canfunction as object requester 154, requesting the facts of objects fromrepository 115.

FIG. 1 shows that data processing system 106 includes a memory 107 andone or more processors 116. Memory 107 includes importers 108, janitors110, build engine 112, service engine 114, and requester 154, each ofwhich are preferably implemented as instructions stored in memory 107and executable by processor 116. Memory 107 also includes repository115. Repository 115 can be stored in a memory of one or more computersystems or in a type of memory such as a disk. FIG. 1 also includes acomputer readable medium 118 containing, for example, at least one ofimporters 108, janitors 110, build engine 112, service engine 114,requester 154, and at least some portions of repository 115. FIG. 1 alsoincludes one or more input/output devices 120 that allow data to beinput and output to and from data processing system 106. It will beunderstood that data processing system 106 preferably also includesstandard software components such as operating systems and the like andfurther preferably includes standard hardware components not shown inthe figure for clarity of example.

Data Structure

FIG. 2( a) shows an example format of a data structure for facts withinrepository 115, according to some embodiments of the invention. Asdescribed above, the repository 115 includes facts 204. Each fact 204includes a unique identifier for that fact, such as a fact ID 210. Eachfact 204 includes at least an attribute 212 and a value 214. Forexample, a fact associated with an object representing George Washingtonmay include an attribute of “date of birth” and a value of “Feb. 22,1732.” In one embodiment, all facts are stored as alphanumericcharacters since they are extracted from web pages. In anotherembodiment, facts also can store binary data values. Other embodiments,however, may store fact values as mixed types, or in encoded formats.

As described above, each fact is associated with an object ID 209 thatidentifies the object that the fact describes. Thus, each fact that isassociated with a same entity (such as George Washington), will have thesame object ID 209. In one embodiment, objects are not stored asseparate data entities in memory. In this embodiment, the factsassociated with an object contain the same object ID, but no physicalobject exists. In another embodiment, objects are stored as dataentities in memory, and include references (for example, pointers orIDs) to the facts associated with the object. The logical data structureof a fact can take various forms; in general, a fact is represented by atuple that includes a fact ID, an attribute, a value, and an object ID.The storage implementation of a fact can be in any underlying physicaldata structure.

FIG. 2( b) shows an example of facts having respective fact IDs of 10,20, and 30 in repository 115. Facts 10 and 20 are associated with anobject identified by object ID “1.” Fact 10 has an attribute of “Name”and a value of “China.” Fact 20 has an attribute of “Category” and avalue of “Country.” Thus, the object identified by object ID “1” has aname fact 205 with a value of “China” and a category fact 206 with avalue of “Country.” Fact 30 208 has an attribute of “Property” and avalue of “Bill Clinton was the 42nd President of the United States from1993 to 2001.” Thus, the object identified by object ID “2” has aproperty fact with a fact ID of 30 and a value of “Bill Clinton was the42nd President of the United States from 1993 to 2001.” In theillustrated embodiment, each fact has one attribute and one value. Thenumber of facts associated with an object is not limited; thus whileonly two facts are shown for the “China” object, in practice there maybe dozens, even hundreds of facts associated with a given object. Also,the value fields of a fact need not be limited in size or content. Forexample, a fact about the economy of “China” with an attribute of“Economy” would have a value including several paragraphs of text,numbers, perhaps even tables of figures. This content can be formatted,for example, in a markup language. For example, a fact having anattribute “original html” might have a value of the original html texttaken from the source web page.

Also, while the illustration of FIG. 2( b) shows the explicit coding ofobject ID, fact ID, attribute, and value, in practice the content of thefact can be implicitly coded as well (e.g., the first field being theobject ID, the second field being the fact ID, the third field being theattribute, and the fourth field being the value). Other fields includebut are not limited to: the language used to state the fact (English,etc.), how important the fact is, the source of the fact, a confidencevalue for the fact, and so on.

FIG. 2( c) shows an example object reference table 210 that is used insome embodiments. Not all embodiments include an object reference table.The object reference table 210 functions to efficiently maintain theassociations between object IDs and fact IDs. In the absence of anobject reference table 210, it is also possible to find all facts for agiven object ID by querying the repository to find all facts with aparticular object ID. While FIGS. 2( b) and 2(c) illustrate the objectreference table 210 with explicit coding of object and fact IDs, thetable also may contain just the ID values themselves in column orpair-wise arrangements.

FIG. 2( d) shows an example of a data structure for facts withinrepository 115, according to some embodiments of the invention showingan extended format of facts. In this example, the fields include anobject reference link 216 to another object. The object reference link216 can be an object ID of another object in the repository 115, or areference to the location (e.g., table row) for the object in the objectreference table 210. The object reference link 216 allows facts to haveas values other objects. For example, for an object “United States,”there may be a fact with the attribute of “president” and the value of“George W. Bush,” with “George W. Bush” being an object having its ownfacts in repository 115. In some embodiments, the value field 214 storesthe name of the linked object and the link 216 stores the objectidentifier of the linked object. Thus, this “president” fact wouldinclude the value 214 of “George W. Bush”, and object reference link 216that contains the object ID for the for “George W. Bush” object. In someother embodiments, facts 204 do not include a link field 216 because thevalue 214 of a fact 204 may store a link to another object.

Each fact 204 also may include one or more metrics 218. A metricprovides an indication of the some quality of the fact. In someembodiments, the metrics include a confidence level and an importancelevel. The confidence level indicates the likelihood that the fact iscorrect. The importance level indicates the relevance of the fact to theobject, compared to other facts for the same object. The importancelevel may optionally be viewed as a measure of how vital a fact is to anunderstanding of the entity or concept represented by the object.

Each fact 204 includes a list of one or more sources 220 that includethe fact and from which the fact was extracted. Each source may beidentified by a Uniform Resource Locator (URL), or Web address, or anyother appropriate form of identification and/or location, such as aunique document identifier.

The facts illustrated in FIG. 2( d) include an agent field 222 thatidentifies the importer 108 that extracted the fact. For example, theimporter 108 may be a specialized importer that extracts facts from aspecific source (e.g., the pages of a particular web site, or family ofweb sites) or type of source (e.g., web pages that present factualinformation in tabular form), or an importer 108 that extracts factsfrom free text in documents throughout the Web, and so forth.

Some embodiments include one or more specialized facts, such as a namefact 207 and a property fact 208. A name fact 207 is a fact that conveysa name for the entity or concept represented by the object ID. A namefact 207 includes an attribute 224 of “name” and a value, which is thename of the object. For example, for an object representing the countrySpain, a name fact would have the value “Spain.” A name fact 207, beinga special instance of a general fact 204, includes the same fields asany other fact 204; it has an attribute, a value, a fact ID, metrics,sources, etc. The attribute 224 of a name fact 207 indicates that thefact is a name fact, and the value is the actual name. The name may be astring of characters. An object ID may have one or more associated namefacts, as many entities or concepts can have more than one name. Forexample, an object ID representing Spain may have associated name factsconveying the country's common name “Spain” and the official name“Kingdom of Spain.” As another example, an object ID representing theU.S. Patent and Trademark Office may have associated name factsconveying the agency's acronyms “PTO” and “USPTO” as well as theofficial name “United States Patent and Trademark Office.” If an objectdoes have more than one associated name fact, one of the name facts maybe designated as a primary name and other name facts may be designatedas secondary names, either implicitly or explicitly.

A property fact 208 is a fact that conveys a statement about the entityor concept represented by the object ID. Property facts are generallyused for summary information about an object. A property fact 208, beinga special instance of a general fact 204, also includes the sameparameters (such as attribute, value, fact ID, etc.) as other facts 204.The attribute field 226 of a property fact 208 indicates that the factis a property fact (e.g., attribute is “property”) and the value is astring of text that conveys the statement of interest. For example, forthe object ID representing Bill Clinton, the value of a property factmay be the text string “Bill Clinton was the 42nd President of theUnited States from 1993 to 2001.” Some object IDs may have one or moreassociated property facts while other objects may have no associatedproperty facts. It should be appreciated that the data structures shownin FIGS. 2( a)-2(d) and described above are merely exemplary. The datastructure of the repository 115 may take on other forms. Other fieldsmay be included in facts and some of the fields described above may beomitted. Additionally, each object ID may have additional special factsaside from name facts and property facts, such as facts conveying a typeor category (for example, person, place, movie, actor, organization,etc.) for categorizing the entity or concept represented by the objectID. In some embodiments, an object's name(s) and/or properties may berepresented by special records that have a different format than thegeneral facts records 204.

As described previously, a collection of facts is associated with anobject ID of an object. An object may become a null or empty object whenfacts are disassociated from the object. A null object can arise in anumber of different ways. One type of null object is an object that hashad all of its facts (including name facts) removed, leaving no factsassociated with its object ID. Another type of null object is an objectthat has all of its associated facts other than name facts removed,leaving only its name fact(s). Alternatively, the object may be a nullobject only if all of its associated name facts are removed. A nullobject represents an entity or concept for which the data processingsystem 106 has no factual information and, as far as the data processingsystem 106 is concerned, does not exist. In some embodiments, facts of anull object may be left in the repository 115, but have their object IDvalues cleared (or have their importance to a negative value). However,the facts of the null object are treated as if they were removed fromthe repository 115. In some other embodiments, facts of null objects arephysically removed from repository 115.

FIG. 2( e) is a block diagram illustrating an alternate data structure290 for facts and objects in accordance with preferred embodiments ofthe invention. In this data structure, an object 290 contains an objectID 292 and references or points to facts 294. Each fact includes a factID 295, an attribute 297, and a value 299. In this embodiment, an object290 actually exists in memory 107.

Overview of Methodology

In one embodiment, the present invention is implemented in a janitor 110to identify duplicate objects so that the duplicate objects can bemerged together. The janitor 110 examines the object reference table210, and reconstructs the objects based on the associations betweenobject IDs and fact IDs maintained in the object reference table 210.Alternatively, the janitor 110 can retrieve objects by asking theservice engine 114 for the information stored in the repository 115.Depending how object information is stored in the repository 115, thejanitor 110 needs to reconstruct the objects based on the facts andobject information retrieved.

Referring to FIG. 3, there is shown a flowchart of an exemplary methodfor identifying duplicate objects according to one embodiment of thepresent invention. The process illustrated in FIG. 3 may be implementedin software, hardware, or a combination of hardware and software.

The flowchart shown in FIG. 3 will now be described in detail,illustrated by the diagram in FIG. 4 and the example in FIGS. 5( a)-(e).The process commences with a set of objects 430 that may containduplicate objects. For example, there may be multiple objects thatrepresent the entity “George Washington.” Each object 430 has a set offacts. As illustrated in FIG. 2( a), each fact 204 has an attribute 212and a value 214 (also called fact value). An example of the set ofobjects 430 is shown in FIG. 5( a).

As shown in FIG. 5( a), objects O1 and O3 are duplicate objectsrepresenting the same entity, a Mr. John M. Doe with nickname D. J. O1is associated with three facts with the following attributes: name,phone number, and type. O3 is associated with four facts: name, phonenumber, type, and birthday. Objects O2 and O4 are duplicate objectsrepresenting a book titled The Relativity. O2 is associated with threefacts: name, ISBN, and year of publication. O4 is associated with threefacts: name, type, and ISBN. Object O5 represents a race horse namedJohn Henry. O5 is associated with four facts: name, type, birthday, andtrainer. Among the duplicate objects, there are considerable variationsin the associated facts. A preferred embodiment of the present inventioncan be used on collections of objects numbering from tens of thousands,to millions, or more.

Referring to FIGS. 3 and 4, the janitor 110 applies 310 a grouper 410 toeach object. The grouper 410 groups similar objects into buckets 460such that if duplicate objects exist, they are included in the samebucket. It will be understood that non-duplicate objects will also be inthe same bucket, but in any case, the large number of objects will bespread out among multiple buckets 460.

As illustrated in FIG. 4, when processing an object 430, the grouper 410calls a signature generator 440 to generate a signature 450 based on thefacts associated with the object 430. The signature generator 440 isdesigned to generate an identical signature for duplicate objects evenif the facts associated with the objects are not duplicates. Thesignature generator 440 as shown in FIG. 4 is part of the grouper 410,but it can also be a separate function/module. The grouper 410 then putsthe object 430 into an existing bucket 460 indexed by the signature 450.If there is no such bucket then a new bucket 460 is created, thesignature 450 is assigned as the index of the bucket 460, and the object430 is put into the bucket 460. When all objects 430 are processed bythe grouper 410, those objects sharing a signature are in the samebucket.

It is noted that the signature generated by the signature generator 440can be a null signature, a signature with an empty value. The grouper410 does not place an object with a null signature into any bucket. As aresult, objects with null signatures are neither compared nor mergedwith other objects. The signature generator 440 can generate a nullsignature because the object is not associated with necessary facts.Alternatively, the signature generator 440 can purposefully generate anull signature for certain objects to prevent the objects from beingconsidered for merger.

In one example, the grouper 410 groups objects 430 based on theassociated type value. A type value is the value of a fact withattribute type. If an object 430 has a type value of “human,” thesignature generator 440 generates the signature 450 based on theassociated phone number value. A phone number value is the value of afact with attribute phone number. If an object 430 has a type value of“book,” the signature generator 440 generates the signature 450 based onthe associated ISBN value. An ISBN value is the value of a fact withattribute ISBN. Otherwise, the signature generator 440 generates thesignature 450 based on the name value. A name value is the value of afact with attribute name. The grouper 410 then places the object 430into a bucket 460 in accordance with the signature 450.

In one embodiment, the signature generator 440 generates the signature450 by concatenating the fact values selected and removing any whitespace in the concatenated string.

FIG. 5( b) shows the fact value used by the above grouper 410 togenerate a signature for each object. As described above, depending onthe type value of the object 430, fact value used by the grouper 410 togenerate the signature 450 for the object 430 varies. FIG. 5( c) showsin which buckets the objects are ultimately placed. Applying the abovegrouper 410, objects O1 and O3 are properly grouped into a bucketindexed by a signature 450 based on “(703) 123-4567,” the phone numbervalue of both objects. Objects O2 and O4 are placed in a bucket indexedby a signature 450 based on “Relativity” and a bucket indexed by asignature 450 based on “0517884410,” respectively. Even though O2 and O4represent the same entity, the signature generator 440 generatesdifferent signatures for them. Because no fact with attribute type isassociated with O2, the signature generator 440 generates the signature450 for O2 based on the associated name value. The type value of O4 is“book,” thus the signature generator 440 generates the signature 450 forO4 based on the associated ISBN value. Because the grouper 410 groupsobjects based on the associated signature, O2 and O4 are placed intodifferent buckets. O5 is grouped into a bucket indexed by a signature450 based on “John Henry,” the associated name value.

Alternatively, the grouper 410 can group objects solely based on theassociated name values. In one example, the signature generator 440applies some normalization rules to the associated name value tostandardize the name value before generating the signature 450. Examplesof the normalization rules include removal of punctuation, such asremoving commas in a string, conversion of uppercase characters in astring to corresponding lowercase characters, such as from “America” to“america,” and stop word removal, such as removing stop words such as“the” and “is” from a string.

FIG. 5( d) shows the name value used by the above grouper 410 togenerate a signature for each object shown in FIG. 5( a). FIG. 5( e)shows in which buckets the objects are ultimately placed. Applying theabove grouper 410, objects O2 and O4 are properly grouped into a bucketindexed by a signature 450 based on “relativity,” the normalized namevalue of both objects, while O1 and O3 are placed in a bucket indexed bya signature 450 based on “john doe” and a bucket indexed by a signature450 based on “dj,” respectively. Because a signature 450 of an object isgenerated based on the associated normalized name value, the signaturefor O1 is based on “john doe” and the signature for O3 is based on “dj,”as shown in FIG. 5( d). As a result, the grouper 410 places O1 and O3into different buckets. O5 is grouped into a bucket indexed by asignature 450 based on “john henry,” the associated normalized namevalue.

Alternatively, the grouper 410 groups objects based on several factvalues associated with the object 430. For example, objects 430 with thesame name value and birthday value are grouped into the same bucket 460under one of such groupers 410.

In another embodiment, a grouper 410 can be a function or a module. Thesystem selects the grouper 410 from a collection of grouperfunctions/modules. The collection of grouper functions/modules includesfunctions/modules provided by a third party, such as commerciallyavailable software libraries for software development, andfunctions/modules previously created.

By selecting different grouper functions/modules, the janitor 110 candetect duplicate objects created from incomplete/inaccurate data moreaccurately. Objects 430 created from incomplete data may not sharefacts, even if they represent the same entity. For example, an object430 representing George Washington created based on a webpage devoted tohis childhood may not have facts about his senior years, while anotherobject 430 also representing George Washington created based on awebpage dedicated to his years of presidency probably would not havefacts about his childhood. Similarly, facts created from differentsources may not share the same values due to inaccurate data, even ifthe associated objects represent the same entity. As a result, no singlegrouper 410 can accurately and consistently group duplicate objects intothe same bucket 460. By providing the ability to select a grouper 410,the janitor 110 can reuse the existing well-tested functions/modules,and select groupers 410 based on the specific needs.

For example, as illustrated in FIGS. 5( c) and 5(e), one grouper 410properly groups O2 and O4 together, but mistakenly places O1 and O3 intodifferent buckets, and another grouper 410 properly groups O1 and O3together, but not O2 and O4. By providing the flexibility of selectingdifferent grouper functions/modules, the janitor 110 can process theobjects multiple times, each time selecting a different grouperfunction/module and matching duplicate objects based on the grouping.Using multiple groupers 410 detects duplicate objects more accuratelythan only using any single grouper 410.

There are many ways for the janitor 110 to select a grouperfunction/module. For example, the janitor 110 can select the grouper 410based on predetermined system configuration. Alternatively, theselection can be determined at run time based on information such as theresult of previous attempt to identify duplicate objects. For example,if many objects do not have the fact(s) looked at by the previouslyselected grouper, the janitor 110 selects a grouper 410 based ondifferent fact(s).

After all objects are grouped into buckets 460, for every bucket 460created, the janitor 110 applies 320 a matcher 420 to every two objectsin the bucket 460, and identifies 330 the matching objects 470 asduplicate objects. The matcher 420 is designed to match duplicateobjects based on the similarity of facts with the same attributeassociated with the two objects (also called simply common facts).Similarity between two corresponding facts can be determined in a numberof ways. For example, two facts are determined to be similar when thefact values are identical. In another example, two facts can bedetermined to be similar when the fact values are lexically similar,such as “U.S.A.” and “United States.” Alternatively, two facts aredetermined to be similar when the fact values are proximately similar,such as “176 pounds” and “176.1 pounds.” In another example, two factsare determined to be similar when the fact values are similar based onstring similarity measure (e.g., edit distance, Hamming Distance,Levenshtein Distance, Smith-Waterman Distance, Gotoh Distance, JaroDistance Metric, Dice's Coefficient, Jaccard Coefficient to name a few).

For example, the matcher 420 determines whether two objects match basedon the number of common facts with similar values (also called simplysimilar common facts) and the number of common facts with values thatare not similar (also called simply dissimilar common facts). In onesuch matcher 420, two objects are deemed to match when there is moresimilar common fact than dissimilar common facts. Applying the abovematcher 420 to the buckets shown in FIG. 5( c), O1 and O3 are determinedto match because there are two similar common facts: phone number andtype, and only one dissimilar common fact: name. As a result, thejanitor 110 properly identifies O1 and O3 as duplicate objects.

In another example, the matcher 420 determines whether two objects matchbased on the proportion of similar common facts and all common facts.

Alternatively, the matcher 420 can determine whether two objects matchbased on one or a combination of associated facts. In one such matcher420, two objects are deemed to match when a fact with attribute ISBN isa common fact, and the associated ISBN values are identical. Applyingthis matcher 420 to the buckets shown in FIG. 5( e), O2 and O4 aredetermined to match. As a result, the janitor 110 properly identifies O2and O4 as duplicate objects.

Alternatively, the matcher 420 can determine whether two objects matchbased on the entropies of matching common facts. Entropy is a measure ofrandomness in a fact value, and can be used to determine the importanceof matching (or mismatching) common facts in determining whether twoobjects are distinct or duplicates. For example, matching facts withattributes such as Social Security Number and ISBN is more significantthan matching facts with attributes such as gender and nationality, andthus have higher entropies. Examples of how to calculate entropy and useentropy in identifying duplicate objects can be found in U.S. Utilitypatent application Ser. No. 11/356,765 for “Attribute Entropy as aSignal in Object Normalization,” by Jonathan Betz, et al., filed Feb.17, 2006. In one such matcher 420, if the sum of entropies of matchingcommon facts is over a threshold, the matcher 420 determines the twoobjects match.

In another embodiment, the janitor 110 does not first apply the matcher420 to every two objects in the bucket 460 and then identify thematching objects 470 as duplicate objects. Instead, the janitor 110applies the matcher 420 to two objects in the bucket 460. If the matcher420 indicates the two objects to be matching objects 470, the janitor110 merges them, keeps the merged object in the bucket 460, and removesthe other object(s) out of the bucket 460. Then, the janitor 110restarts the process by applying the matcher 420 to two objects in thebucket 460 that have not been matched before. This process continuesuntil the matcher 420 has been applied to every pair of objects in thebucket 460.

The janitor 110 can merge two objects in several different ways. Forexample, the janitor 110 can choose one of the two objects as the mergedobject, add facts only present in the other object to the merged object,and optionally reconcile the dissimilar common facts of the mergedobject. Alternatively, the janitor 110 can create a new object as themerged object, and add facts from the two matching objects to the mergedobject.

In another embodiment, just as a grouper 410, a matcher 420 can be afunction or a module. The system selects the matcher 420 from acollection of matcher functions/modules. The collection of matcherfunctions/modules includes functions/modules provided by a third partyand functions/modules previously created. By providing the ability toselect a matcher 420, the janitor 110 can reuse the existing well-testedfunctions/modules, and select matcher 420 based on the specific needs.

As stated above, one matcher properly matches O2 and O4, but not O1 andO3, and another matcher properly matches O1 and O3, but not O2 and O4.By providing the flexibility of selecting different matcherfunctions/modules, the janitor 110 can process the objects multipletimes, each time selecting a different grouper-matcher combination, andidentify duplicate objects more accurately.

There are many ways for the janitor 110 to select a matcherfunction/module. For example, the janitor 110 can select the matcher 420based on system configuration data. Alternatively, the selection can bedetermined at run time based on information such as the grouper 410selected. For example, if the resulting buckets of the grouper 410include many objects, the janitor 110 selects a matcher function/modulerequiring a higher entropy threshold.

Duplicate objects are objects representing the same entity but eachhaving a different object ID. After identifying 330 the matching objectsas duplicate objects, the janitor 110 can merge the duplicate objectsinto a merged object, so that each entity is represented by no more thanone object and each fact that is associated with a same entity will havethe same object ID.

Finally, it should be noted that the language used in the specificationhas been principally selected for readability and instructionalpurposes, and may not have been selected to delineate or circumscribethe inventive subject matter. Accordingly, the disclosure of the presentinvention is intended to be illustrative, but not limiting, of the scopeof the invention, which is set forth in the following claims.

1. A computer-implemented method of identifying duplicate objects in aplurality of objects, the method comprising: generating signatures fortwo or more objects of the plurality of objects based on factsassociated with the two or more objects, wherein the plurality ofobjects is included in a fact repository, wherein signatures forduplicate objects are identical even if at least some facts associatedwith the duplicate objects are different, and wherein generating arespective signature for a respective object based on facts associatedwith the respective object includes: retrieving, from the factrepository, a type value associated with a type attribute of a factassociated with the respective object, wherein the type valuecategorizes an entity represented by the respective object; selecting anattribute of the fact associated with the respective object based on thetype value of the fact; and generating the respective signature based ona value associated with the selected attribute, wherein each of theplurality of objects is associated with one or more facts, each of theone or more facts having a value, and wherein generating the respectivesignature comprises generating the respective signature for each of thetwo or more of the plurality of objects by deriving the respectivesignatures from the value of at least one associated fact of the each ofthe two or more of the plurality of objects; grouping the two or moreobjects of the plurality of objects into a plurality of buckets based onthe generated signatures, wherein the grouping comprises, responsive toan identifier of an existing bucket being the same as the signature ofan object, the object being one of the two or more of the plurality ofobjects, adding the object to the existing bucket, and otherwiseestablishing a new bucket including the object, an identifier of the newbucket being same as the signature of the object; applying a matcher torespective pairs of objects in one of the plurality of buckets todetermine if the respective pairs of objects are duplicates; and mergingone or more of the objects determined by the matcher to be duplicates.2. The method of claim 1, including: applying a grouper to the two ormore of the plurality of objects wherein the grouper is selected from acollection of groupers, wherein grouping includes grouping the two ormore of the plurality of objects into the plurality of buckets byapplying the selected grouper to the two or more of the plurality ofobjects.
 3. The method of claim 1, further comprising: selecting thematcher from a collection of matchers, wherein applying the matcherincludes applying the selected matcher to each pair of objects in one ofthe plurality of buckets to determine if the pair of objects areduplicates.
 4. The method of claim 1, wherein each of the plurality ofobjects is associated with one or more facts, and wherein at least twoobjects of the plurality of objects have different facts.
 5. The methodof claim 1, wherein each of the one or more facts has an attribute, andwherein the at least one fact includes at least one selected from thegroup consisting of: fact with attribute name, fact with attribute ISBN,and fact with attribute UPC.
 6. The method of claim 1, wherein each ofthe plurality of objects is associated with one or more fact, each ofthe one or more facts having a value, and wherein applying the matcherto each pair of objects in one of the plurality of buckets comprises:for each common fact of the pair of objects, determining a similarity ofthe values of the common fact based on a similarity measure; anddetermining that the pair of objects are duplicates based on thesimilarity.
 7. The method of claim 6, wherein determining that the pairof objects are duplicates comprises: determining that the pair ofobjects are duplicates based on the number of the common facts withsimilar values and the number of common facts.
 8. The method of claim 6,wherein each of the one or more facts has an entropy, and whereindetermining that the pair of objects are duplicates comprises:determining that the pair of objects are duplicates based on theentropies of the common facts with similar values.
 9. A system foridentifying duplicate objects in a plurality of objects, comprising: aprocessor for executing programs; and a subsystem executable by theprocessor, the subsystem including: instructions for generatingsignatures for two or more objects of the plurality of objects based onfacts associated with the two or more objects, wherein the plurality ofobjects is included in a fact repository, wherein signatures forduplicate objects are identical even if at least some facts associatedwith the duplicate objects are different, and wherein the instructionsfor generating a respective signature for a respective object based onfacts associated with the respective object include instructions for:retrieving, from the fact repository, a type value associated with atype attribute of a fact associated with the respective object, whereinthe type value categorizes an entity represented by the respectiveobject; selecting an attribute of the fact associated with therespective object based on the type value of the fact; and generatingthe respective signature based on a value associated with the selectedattribute, wherein each of the plurality of objects is associated withone or more facts, each of the one or more facts having a value, andwherein the instructions for generating the respective signaturecomprises instructions for generating the respective signature for eachof the two or more of the plurality of objects by deriving therespective signature from the value of at least one associated fact ofthe each of the two or more of the plurality of objects; instructionsfor grouping the two or more objects of the plurality of objects into aplurality of buckets based on the generated signatures, wherein theinstruction for grouping include instructions for, responsive to anidentifier of an existing bucket being the same as the signature of anobject, the object being one of the two or more of the plurality ofobjects, adding the object to the existing bucket, and otherwiseestablishing a new bucket including the object, an identifier of the newbucket being same as the signature of the object; instructions forapplying a matcher to respective pairs of objects in one of theplurality of buckets to determine if the respective pairs of objects areduplicates; and instructions for merging one or more of the objectsdetermined by the matcher to be duplicates.
 10. A computer programproduct for use in conjunction with a computer system, the computerprogram product comprising a computer readable storage medium and acomputer program mechanism embedded therein, the computer programmechanism including: instructions for generating signatures for two ormore objects of the plurality of objects based on facts associated withthe two or more objects, wherein the plurality of objects is included ina fact repository, wherein signatures for duplicate objects areidentical even if at least some facts associated with the duplicateobjects are different, and wherein the instructions for generating arespective signature for a respective object based on facts associatedwith the respective object include instructions for: retrieving, fromthe fact repository, a type value associated with a type attribute of afact associated with the respective object: wherein the type valuecategorizes an entity represented by the respective object; selecting anattribute of the fact associated with the respective object based on thetype value of the fact; and generating the respective signature based ona value associated with the selected attribute, wherein each of theplurality of objects is associated with one or more facts, each of theone or more facts having a value, and wherein the instructions forgenerating the respective signature comprises instructions forgenerating the respective signature for each of the two or more of theplurality of objects by deriving the respective signature from the valueof at least one associated fact of the each of the two or more of theplurality of objects; instructions for grouping the two or more objectsof the plurality of objects into a plurality of buckets based on thegenerated signatures, wherein the instructions for grouping includeinstructions for, responsive to an identifier of an existing bucketbeing the same as the signature of an object, the object being one ofthe two or more of the plurality of objects, adding the object to theexisting bucket, and otherwise establishing a new bucket including theobject, an identifier of the new bucket being same as the signature ofthe objects; instructions for applying a matcher to respective pairs ofobjects in one of the plurality of buckets to determine if therespective pairs of objects are duplicates; and instructions for mergingone or more of the objects determined by the matcher to be duplicates.11. The system of claim 9, including: instructions for applying agrouper to the two or more of the plurality of objects wherein thegrouper is selected from a collection of groupers, wherein groupingincludes grouping the two or more of the plurality of objects into theplurality of buckets by applying the selected grouper to the two or moreof the plurality of objects.
 12. The system of claim 9, including:instructions for selecting the matcher from a collection of matchers,wherein applying the matcher includes applying the selected matcher toeach pair of objects in one of the plurality of buckets to determine ifthe pair of objects are duplicates.
 13. The system of claim 9,including: instructions for adding the object to the existing bucketresponsive to an identifier of an existing bucket being the same as thesignature of an object, the object being one of the two or more of theplurality of objects; and instructions for establishing a new bucketincluding the object, an identifier of the new bucket being same as thesignature of the object, responsive to an identifier of an existingbucket not being the same as the signature of an object, the objectbeing one of the two or more of the plurality of objects.
 14. The systemof claim 9, including: instructions for determining a similarity of thevalues of a common fact based on a similarity measure for each commonfact of the pair of objects; and instructions for determining that thepair of objects are duplicates based on the similarity.
 15. The computerprogram product of claim 10, including: instructions for applying agrouper to the two or more of the plurality of objects wherein thegrouper is selected from a collection of groupers, wherein groupingincludes grouping the two or more of the plurality of objects into theplurality of buckets by applying the selected grouper to the two or moreof the plurality of objects.
 16. The computer program product of claim10, including: instructions for selecting the matcher from a collectionof matchers, wherein applying the matcher includes applying the selectedmatcher to each pair of objects in one of the plurality of buckets todetermine if the pair of objects are duplicates.
 17. The computerprogram product of claim 10, including: instructions for adding theobject to the existing bucket responsive to an identifier of an existingbucket being the same as the signature of an object, the object beingone of the two or more of the plurality of objects; and instructions forestablishing a new bucket including the object, an identifier of the newbucket being same as the signature of the object, responsive to anidentifier of an existing bucket not being the same as the signature ofan object, the object being one of the two or more of the plurality ofobjects.
 18. The computer program product of claim 10, including:instructions for determining a similarity of the values of a common factbased on a similarity measure for each common fact of the pair ofobjects; and instructions for determining that the pair of objects areduplicates based on the similarity.